Through the use of data analysis techniques, I was able to form a picture of how profitability on short term rentals changes across New York City, in particular for zip codes. Using the result, the clients will be able to better understand the market and assess risks. These results can also be used in making decisions under different scenarios and time lines. An interactive Shiny app has also been provided to give more intuitive insights.
setwd("~/Desktop/apply for job!!!/company/Capital One/airbnb-zillow-data-challenge-master")
#setwd("working directory")
airbnb <- fread('listings.csv',header=T,na.strings=c(""))
# for windows users, the file you download may be 'listings.csv.gz'
# airbnb <- fread('listings.csv',header=T,na.strings=c("")) #using package R.utils
zillow <- fread('Zip_Zhvi_2bedroom.csv',header=T,na.strings=c(""))
Cost and Revenue datasets are handled separately in an attempt to enrich the data quality for exploratory data analysis.
Revenue data contains a mix of information including details about the properties like address, zipcode, bedrooms, bathrooms to information about host, daily/weekly and monthly price details for stay.
Dimension of Revenue Data, Summary and Check for NAs and unique values.
Remove the columns that start with “require” or “host” and columns that end with “url” and “nights” as it’s found that these columns are not relevant with the problem in this case. 28 columns are removed.
pattern <-
c(
colnames(
airbnb %>% select(
starts_with("require"),
starts_with("host"),
ends_with("url"),
ends_with("nights"))
))
airbnb1 <- airbnb[, (pattern) := NULL]
pattern
## [1] "requires_license" "require_guest_profile_picture"
## [3] "require_guest_phone_verification" "host_id"
## [5] "host_url" "host_name"
## [7] "host_since" "host_location"
## [9] "host_about" "host_response_time"
## [11] "host_response_rate" "host_acceptance_rate"
## [13] "host_is_superhost" "host_thumbnail_url"
## [15] "host_picture_url" "host_neighbourhood"
## [17] "host_listings_count" "host_total_listings_count"
## [19] "host_verifications" "host_has_profile_pic"
## [21] "host_identity_verified" "listing_url"
## [23] "thumbnail_url" "medium_url"
## [25] "picture_url" "xl_picture_url"
## [27] "minimum_nights" "maximum_nights"
Character columns with near 100% variance (Every Row is different) are removed as they provide no group level information that can be used on a larger population/scale. These columns include textual columns describing the home, host, amenties etc.
uniquedf <-
airbnb1 %>% select_if(is.character) %>% summarise_all(funs(n_distinct(.)))
uniquedf <-
uniquedf %>% gather(key = var_name, value = value, 1:ncol(uniquedf))
uniquedf$percentUnique <- round(uniquedf$value/nrow(airbnb1),2)
uniqueval <- uniquedf %>% filter(percentUnique > 0.2) %>% pull(var_name)
airbnb2 <- airbnb1[, (uniqueval) := NULL]
uniqueval
## [1] "name" "summary"
## [3] "space" "description"
## [5] "neighborhood_overview" "notes"
## [7] "transit" "access"
## [9] "interaction" "house_rules"
## [11] "amenities"
Get the matrics for vairables that have near zero variance using caret package and remove them.
zvdf <- nearZeroVar(airbnb2, saveMetrics = TRUE)
ZVnames=rownames(subset(zvdf, nzv== TRUE))
airbnb3 <- airbnb2[ , (ZVnames) := NULL]
ZVnames
## [1] "scrape_id" "experiences_offered" "state"
## [4] "market" "country_code" "country"
## [7] "bed_type" "has_availability" "license"
At last, 14 columns are removed manually by going through the dictionary of the dataset based on the relevance with business problem.
Given six aspects for data quality check, several data cleaning approaches are taken for important columns.
After data cleaning, the dataset has 149 unique zip code.
############################################################################################# Zipcode
# 206 unique zipcodes, 567 missing
# replace missing with lat and long since there are no NA for lat and long
# 1. select NA values
zipcode.na <- revenue[,c('id','zipcode','latitude','longitude')]
zipcode.na <- zipcode.na[is.na(zipcode.na$zipcode), ]
# 2. generate zip code based on lat and long
result <- do.call(rbind,
lapply(1:nrow(zipcode.na),
function(i)revgeocode(as.numeric(zipcode.na[i,4:3])))) # takes almost 5 mins to run
# in case it cannot run, I also put the result in the folder
# save(result, file="result.Rdata")
# load("~(the working directory)/result.Rdata") #please run this line if the above code doesn't work properly.
setDT(zipcode.na)
zipcode.na[, result := result]
zipcode.na$zipcode_new <- substr(str_extract(zipcode.na$result," [0-9]{5}, .+"),2,6)
# check if there's still na
zipcode.na[is.na(zipcode.na$zipcode_new),]
## id zipcode latitude longitude
## 1: 4896855 <NA> 40.83314 -73.91888
## result zipcode_new
## 1: Grand Concourse/MC Clellan St, The Bronx, NY, USA <NA>
# <NA> Grand Concourse/MC Clellan St, The Bronx, NY, USA -> 10452
zipcode.na[is.na(zipcode.na$zipcode_new),]$zipcode_new <- '10452'
# keep id and zipcode to merge with revenue table and replace the NAs
zipcode.na <- zipcode.na[,-c(2:5)]
revenue <- left_join(revenue, zipcode.na, by = "id")
setDT(revenue)
revenue[is.na(zipcode), zipcode := zipcode_new]
revenue[, zipcode_new := NULL]
# check if theres still NA in zip code
revenue[is.na(revenue$zipcode),]
## Empty data.table (0 rows) of 33 cols: id,last_scraped,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed...
There’s no NA in the zip code column
# Unify zip code to 5 digits
# Valid Data Entries: A valid zipcode should have length = 5. Checking for number of zipcodes with valid length.
# for length >5, keep the first 5
# dfnew <- df[, zipcode:=as.character(zipcode)] # no use
#df[ ,nchar(df$zipcode) > 5] <- substr(df[nchar(df$zipcode) > 5]$zipcode, 1, 5)
revenue[nchar(revenue$zipcode) > 5, zipcode :=substr(revenue[nchar(revenue$zipcode) > 5]$zipcode, 1, 5)]
# for <5, convert lat and long to zipcode
# 1. select <5 values
zipcode5 <- revenue[,c('id','zipcode','latitude','longitude')]
zipcode5 <- zipcode5[nchar(zipcode5$zipcode) < 5, ]
# 2. generate zip code based on lat and long
result2 <- do.call(rbind,
lapply(1:nrow(zipcode5),
function(i)revgeocode(as.numeric(zipcode5[i,4:3])))) # takes about one minute to run
setDT(zipcode5)
zipcode5[, result := result2]
zipcode5$zipcode_new <- substr(str_extract(zipcode5$result," [0-9]{5}, .+"),2,6)
zipcode5 <- zipcode5[,-c(2:5)]
revenue <- left_join(revenue, zipcode5, by = "id")
setDT(revenue)
revenue[nchar(revenue$zipcode) < 5, zipcode := zipcode_new]
revenue[, zipcode_new := NULL]
# check if the length all converts to 5 digits
revenue[nchar(revenue$zipcode) != 5,]
## Empty data.table (0 rows) of 33 cols: id,last_scraped,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed...
Most of properties in the data is one bedroom home/apt or one bedroom private room. Based on the assumption give in the problme statement, I chose two bed room properties here. In the next step I converted the price by times 2. But we should keep in mind the fact that it includes some price for two bedrooms private room, which should be lower than that for entire home/apt. So I underestimated the price here.
############################################################################################# Bedrooms
revenue %>%
group_by(room_type,bedrooms) %>%
summarise(no_properties = n())
## # A tibble: 21 x 3
## # Groups: room_type [?]
## room_type bedrooms no_properties
## <chr> <int> <int>
## 1 Entire home/apt 0 3369
## 2 Entire home/apt 1 10104
## 3 Entire home/apt 2 4593
## 4 Entire home/apt 3 1349
## 5 Entire home/apt 4 340
## 6 Entire home/apt 5 81
## 7 Entire home/apt 6 27
## 8 Entire home/apt 7 8
## 9 Entire home/apt 8 7
## 10 Entire home/apt 9 2
## # ... with 11 more rows
revenue1 <- revenue[bedrooms==2,]
# 23896 obs
############################################################################################# Price
# remove the dollar and comma sign and change it into numeric for further calculation
revenue1$price <- gsub('\\$', '', revenue1$price)
revenue1$price_new <- as.numeric(gsub(",", "", revenue1$price))
revenue1[room_type == 'Private room', price_new := price_new*2]
############################################################################################# date format
revenue1$last_scraped <- as.Date(revenue1$last_scraped,format = '%Y-%m-%d')
revenue1$first_review <- as.Date(revenue1$first_review,format = '%Y-%m-%d')
revenue1$last_review <- as.Date(revenue1$last_review,format = '%Y-%m-%d')
First subset the dataset by filtering the city name that is “New York”. There are 25 unique zip codes from four boroughs: Brooklyn, Manhattan, Queens, Staten Island in the subset dataset. * County name + Convert it to borough names for further analysis * Quality check: + There’re missing values from 1996-04 to 2007-05 for median price. + There’s no duplication in the cost dataset.
After data cleaning, there are 25 unique observations, each representing a zip code in the dataset.
Based on the assumption that all properties and all square feet within each locale can be assumed to be homogeneous, created the column of the boroughs each zip code belongs to.
# change the column 'CountyName' to 'neighbourhood_group_cleansed'
colnames(zillow.ny)[5] <- 'borough'
# convert the county name to borough name based on definition
zillow.ny$borough <- ifelse(zillow.ny$borough %in% c('Queens','Bronx'), zillow.ny$borough, ifelse(zillow.ny$borough == 'Kings', 'Brooklyn', ifelse(zillow.ny$borough == 'New York', 'Manhattan','Staten Island')))
# for rShiny
# keep price from 2007-2017
df1 <- zillow.ny[,c(1,5,141:261)]
# reshape wide format to long format
price <- df1 %>%
gather('time', 'price',"2007-06":"2017-06")
price$time2 <- gsub("[: -]", "" , price$time, perl=TRUE)
# change format of time
price$time <- as.Date(paste(price$time,"-01",sep=""))
# for rShiny app
write.csv(price,file = 'cost.csv')
# use the central moving average to represent the cost
zillow.ny1 <- zillow.ny[,c('RegionName','borough',
'2017-02','2017-03','2017-04','2017-05','2017-06')] %>%
mutate(ma_5 = (`2017-02`+`2017-03`+`2017-04`+`2017-05`+`2017-06`)/5)
# merge two datasets (inner join)
alldt <- merge(revenue1, zillow.ny1, by.x = 'zipcode', by.y = 'RegionName')
After capping, the distribution seems still to be influenced by outliers.
Staten Island and Queens have lower price and narrower distribution and this is because of limited sample size.
Manhattan has a wider range of distribution of price.
Remove “fake listings”. there are 17 rows removed.
fakelistings <- alldt[((alldt$last_review < as.Date('2015-05-01')) &
(alldt$availability_30 == 0) &
(alldt$availability_60 == 0) &
(alldt$availability_90 == 0) &
(alldt$availability_365 == 0)),]
# remove the listings
alldt <- alldt[((alldt$last_review >= as.Date('2015-05-01')) |
is.na(alldt$last_review)|
(alldt$availability_30 != 0) |
(alldt$availability_60 != 0) |
(alldt$availability_90 != 0) |
(alldt$availability_365 != 0)),]
Given the observations above and the assumption I have for the time, I used centered moving average to represent the property price in April 2017 in order to reduce the noise and uncover patterns in the data. In particular, I calculated a 5-point moving average. The code can be found right before the merge code. \(y_t = \frac{y_{t-2} + y_{t-1} + y_t + y_{t+1} + y_{t+2}}{5}\)
Occupancy rate for a short-term vacation rental is the number of booked nights divided by the sum of the available nights and booked nights.
# construct occupancy rate
alldt$or <- 0.75
# get the year for first review
alldt$year <- substr(alldt$first_review,1,4)
Here I used the unit of time as one year. I calculated the annual revenue based on the occupancy rate and median price for each zip code.
I chose three metrics to evaluate the profitability for each zip code as they focus on different aspects for investment: firstly, it’s Return on Investment (ROI), which measures the efficiency of the investment; second is Break-Even Period, which measures how long does it take to pay back the initial cost; third is total number of reviews, which measures the demand for the zip code. I included these three metrics as “there is no single metric that will give all the information you need to make the best choices possible about real estate investment.” In addition, since I assume the occupancy to be the same across zip codes, which is not realistic, it’s important to incorporate the metric that could reflect actual demand for each zip code in the analysis.
\(ROI_i = \frac{Current Investment Value-Cost of Investment}{Cost of Investment} = \frac{Revenue_i+Appreciation Value_i-Purchase Price}{Cost of Investment}\)
in which \(Revenue_i\) represents the total revenue generated until \(year_i\), \(Appreciaion value_i\) represents the total appreciation value of the property until \(year_i\). For now, I only consider \(Revenue_i\) given time limitation. I set \(i\) as 1, 5, 10, 15, 20, 25, 30, 35, 40 respectively so that I could analyze the profitability for each zip code in different time frames. And I chose and divided it by 40 to get the annuzlized ROI rate for convenience of comparison between different zip code.
\(Annual Net Operating Income (NOI) = 365 × Occupancy Rate × Daily Revenue\)
\(Breakeven Period = \frac{Innital Cost}{Annual NOI}\)
rev_peryear: Calculate revenue per yearbeperiod: Calculate break even period# create median price for each zip code
all_zp <- alldt %>%
group_by(zipcode) %>%
dplyr::summarize(neighbourhood = paste(unique(neighbourhood_group_cleansed)),
med_rev = median(price_new),
mean_rev = mean(price_new), cost = mean(ma_5), or = mean(or),
tot_review = sum(number_of_reviews),
ave_review = mean(number_of_reviews),n=n(),
long = round(mean(longitude),5),lat=round(mean(latitude),5)
)
all_zp <- as.data.frame(all_zp)
#all_zp <- merge(all_zp,df1[,-2],by.x='zipcode',by.y='RegionName')
all_zp$rev_peryear <- 0
all_zp$beperiod <- 0
# revenue generated per year is time * occupancy rate * median price
all_zp$rev_peryear <- 365*all_zp$or*all_zp$med_rev
# cap rate equals to annual noi / total cost
all_zp$cr <- all_zp$rev_peryear/all_zp$cost*100
# break-even period equals to 1/cap rate
all_zp$beperiod <- all_zp$cost/all_zp$rev_peryear
# profitability: ROI in year 1, year 5 to year 40
all_zp <- all_zp %>% mutate(year_01 = -(cost - rev_peryear)/cost,
year_05 = -(cost - 5*rev_peryear)/cost,
year_10 = -(cost - 10*rev_peryear)/cost,
year_15 = -(cost - 15*rev_peryear)/cost,
year_20 = -(cost - 20*rev_peryear)/cost,
year_25 = -(cost - 25*rev_peryear)/cost,
year_30 = -(cost - 30*rev_peryear)/cost,
year_35 = -(cost - 35*rev_peryear)/cost,
year_40 = -(cost - 40*rev_peryear)/cost,
annual_roi = year_40/40)
# create data table for rShiny
top_n <- data.frame(matrix(ncol = 0, nrow = nrow(all_zp)))
top_n$return_on_investment <- all_zp[order(-all_zp$year_01),]$zipcode
top_n$break_even_period <- all_zp[order(all_zp$beperiod),]$zipcode
top_n$total_reviews <- all_zp[order(-all_zp$tot_review),]$zipcode
## Source : https://maps.googleapis.com/maps/api/staticmap?center=New%20York&zoom=11&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=New+York&key=xxx
## Warning: Removed 37 rows containing non-finite values (stat_density2d).
## Warning: Removed 37 rows containing non-finite values (stat_density2d).
I didnt use availability to forcast occupancy rate because hosts updated their properties’ availability different time.
Combine each metric together and plot them as below.
| return_on_investment | break_even_period | total_reviews |
|---|---|---|
| 10312 | 10312 | 10003 |
| 10304 | 10304 | 10036 |
| 11434 | 11434 | 11215 |
| 10305 | 10305 | 11217 |
| 10306 | 10306 | 10025 |
| 11234 | 11234 | 10014 |
| 10036 | 10036 | 10013 |
| 10025 | 10025 | 10011 |
| 10308 | 10308 | 11201 |
| 11231 | 11231 | 11231 |
Tidying up profit summary data for visulization. Using the plot below we can see how ROI changes over time for each zip code. By year 40, ROIs for all the zip codes are above zero and 10312 has the highest ROI, which aligns with the figure above.